Introduction

I am interested in education data and predicting the success of students in better ways hopefully to reduce the amount of testing (sampling) that is needed for each student, allowing them and their teachers to spend more time on preductive learning instead of time test taking or preparing for said tests. The data set I chose to explore is the “Regulatory Adjusted Four-Year Cohort Graduation Rates, School Year 2011-12” datset from EDFact, an initiative set forth by the U.S. Department of Education to “put performance data at the center of policy, management, and budget decisions for all K-12 educational programs”. [1]

Setup

Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(maps)
## 
##  # ATTENTION: maps v3.0 has an updated 'world' map.        #
##  # Many country borders and names have changed since 1990. #
##  # Type '?world' or 'news(package="maps")'. See README_v3. #
library(gridExtra)

Loading and Cleaning The Data

The data has some values that need changing for my initial exploratory purposes. These values arise due to the need to protect student information under the Family Educational Rights and Privacy Act (FERPA). For instance, in order to maintain student anonymity, small samples (5 students or less) have their values omitted (replaced with “PS”). I will be ignoring these cells in calculations that use them. Also, medium sized samples (from 6 to 300 students) have varying degrees of “blurring”, where exact graduation percentages are not reported, but ranges are used instead. I replace these ranges with point estimates that are normally distributed, centered around the range’s midpoint, but never allowed to extend past the range’s maximum or minimum.

Additionally, since I’m only looking at the 2011-2012 school year, the 1112 identifier on many of the column names is unnecessary, so I rename those columns, removing that portion of the name. E.g., “ALL_COHORT_1112” becomes “ALL_COHORT”.

Note: All of the above changes were preprocessed and saved into the “graduation_rate_data.csv” file. To see how these calculations were performed, see my raw code file titled “P4_scratchpad.R”.

data <- read.csv("graduation_rate_data.csv")


#Some of the schools were resporting way more students than the others.
#I looked up the largest ever graduating class in the U.S. and it was 1561 students,
#so I removed all rows with cohorts over 1600.
#(https://en.wikipedia.org/wiki/Plano_East_Senior_High_School)
data <- data %>% filter(ALL_COHORT<1600)

I’m interested in what factors could be affecting the graduation rates, as well. So I found another data set[2] that has school data for pupil/teacher ratios and revenue data, which I suspect influence graduation rates.

school_districts <- read.csv("school_district_data.csv")

#Combine extra data from school_districts into main dataframe
data <- left_join(data, school_districts, by="leaid11")
data <- data %>% rename(subregion=County.Name)

#Add computed column for revenue per student since this is probably more predictive that total revenue
data$Total.General.Revenue.Per.Student <- data$Total.General.Revenue/data$Total.Students

#Remove Infinities
data <- data %>% mutate(Total.General.Revenue.Per.Student=ifelse(Total.General.Revenue.Per.Student==Inf, NA, Total.General.Revenue.Per.Student))

#Remove schools that have per student revenues of over the .999 quantile.
data <- data %>% filter(Total.General.Revenue.Per.Student<quantile(Total.General.Revenue.Per.Student, .999, na.rm=TRUE))

#Remove Schools that have over 100 pupils per teacher. Either these values are mistaken, or these schools have atypical class environments and I don't want them affecting my analysis.
data <- data %>% filter(Pupil.Teacher.Ratio<=100)

Defining the Hypothesis Variables

My initial hypothesis about the data is that one of three variables (or some combination thereof) has an effect of the Graduation Rate for a given school. The three variables I’m hypothesising to have an effect are

  • Number of Students per School
  • Pupil/Teacher Ratio
  • Revenue per Student

In the analysis and exploration that follows, I look primarily at these variables (henceforth referred to as the hypothesis variables). Although I will periodically look at other aspects of the data.

Univariate and Bivariate Exploration

The most important variable in my dataset is ALL_RATE, which is the percentage of students who graduated from each school (i.e., Graduation Rate). This is the most interesting variable because it’s the best way to operationalize success in terms of the U.S. education system teaching our young people. My intuition is that Number of Students Per School, Pupil/Teacher Ratio and Revenue Per Student are going to be contributing factors to the graduation rate, so I’m very interested in those, too.

Histograms

The first thing I do is create histograms for these four variables. Additionally, I plot the histogram of the Revenue Per Student variable with a log tranformation on the x-axis to better observe that variable’s distribution.

ggplot(aes(ALL_RATE), data=data) + xlab("Graduation Rate") + geom_histogram(fill="lightblue", color="black", binwidth=1)

ggplot(aes(ALL_COHORT), data=data) + xlab("Number of Students Per School") + geom_histogram(fill="orange", color="black", binwidth=1) 

ggplot(aes(Pupil.Teacher.Ratio), data=data) + xlab("Pupil Teacher Ratio") + geom_histogram(fill="purple", color="black", binwidth=1) + xlim(0,40)

ggplot(aes(Total.General.Revenue.Per.Student), data=data) + xlab("Revenue Per Student") + geom_histogram(fill="lightgreen", color="black", binwidth=10000)

ggplot(aes(Total.General.Revenue.Per.Student), data=data) + xlab("Revenue Per Student") + geom_histogram(fill="lightgreen", color="black", binwidth=.05) + scale_x_continuous(trans="log10")

The Graduation Rate distribution is somewhat normal, though it’s obviously cut off at 100 since it’s impossible for a percentage to go over 100, and it has a long left tail.

The Number of Students Per School distribution appears to be exponential, which was surprising to me. I would have expected it to be normal as well, with only a few small schools, many average-sized schools, and a few large schools. However, it turns out that’s not the case. When the number of students is divided by the number of teachers to yield the Pupil Teacher Ratio, then the data does fit a normal distribution, though.

The Revenue Per Student distribution isn’t obvious from the linear histrogram, but transforming the x axis to a log scale reveals a close to normal distribution that is a little right skewed. So the Revenue Per Student distribution is approximately log-normal.

For a numerical reference, I then called the summary function on each of these variables, so that I didn’t have to eyeball the defining statistical features.

summary(data$ALL_RATE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   73.00   86.00   77.94   92.76  100.00    1049
summary(data$ALL_COHORT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    36.0   102.0   173.6   270.0  1550.0
summary(data$Pupil.Teacher.Ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.89   13.21   15.18   15.66   17.52   86.10
summary(data$Total.General.Revenue.Per.Student)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    9631   11330   13120   14290  129600

Correlations

data %>% select(ALL_RATE, ALL_COHORT, Pupil.Teacher.Ratio, Total.General.Revenue.Per.Student) %>% 
  rename(grad.rate=ALL_RATE, num.students=ALL_COHORT, pt.ratio=Pupil.Teacher.Ratio, rev.per.stud=Total.General.Revenue.Per.Student) %>% 
cor(use="na.or.complete")
##                grad.rate num.students   pt.ratio rev.per.stud
## grad.rate     1.00000000    0.2173667 -0.1186047  -0.03554802
## num.students  0.21736669    1.0000000  0.2879385  -0.10063393
## pt.ratio     -0.11860466    0.2879385  1.0000000  -0.39486433
## rev.per.stud -0.03554802   -0.1006339 -0.3948643   1.00000000

The correlations between the three hypothesis variables and graduation rate are not strong (less than 0.3 in magnitude).

The strongest of the three is the Number of Students per School, which was not at all my first guess for strongest predictor. A possible explanation for this is that having a lot of peers increases the pressure for doing well in school. Additionally, larger schools are more likely to be in urban areas, so there could be a lot of other confounding factors that come along with that trend.

The Pupil/Teacher Ratio is weaker still, but at least the direction of the correlation is to be expected. It’s a negative correlation, so as the number of students per teacher goes up, the graduation rate tends to go down. However, this is a very weak relationship.

Finally, the last variable on the list was the one I expected to be the strongest predictor of graduation rate. Revenue Per Student has a paltry correlation constant of -0.035. There must be other factors at play here, because there is a significant negative correlation between Revenue Per Student and Pupil/Teacher Ratio. The two variables have a -0.372 correlation, so as Revenue Per Student goes up, there are fewer students for every teacher, and having fewer students per teacher is somewhat negatively correlated with higher graduation rates, but that chain of logic isn’t strong enough for there to be a correlation between Revenue Per Student and Graduation Rate.

Data exploration is about uncovering knowledge, and that knowledge is valuable if it confronts our preconceived notions about how something works. Usually, one thinks that this means showing a relationship between features that was previously undiscovered. However, it could just as easily be a demonstration that no relationship exists where “common sense” would lead one to believe a relationship exists. Based on these low correlation values, it appears that none of my hypothesis variables are strong predictors for Graduation Rate. For the remainder of my analysis, I will work on exploring this lack of a relationship.

Boxplots (by state)

Next, I want to do some regional comparisons. I start out with box plots of the four variables we’ve looked at so far by state.

ggplot(aes(x=Postal.Code, y=ALL_RATE), data=data) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 270, hjust = 1)) 
## Warning: Removed 1049 rows containing non-finite values (stat_boxplot).

ggplot(aes(x=Postal.Code, y=ALL_COHORT), data=data) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 270, hjust = 1)) 

ggplot(aes(x=Postal.Code, y=Pupil.Teacher.Ratio), data=data) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 270, hjust = 1))

ggplot(aes(x=Postal.Code, y=Total.General.Revenue.Per.Student), data=data) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 270, hjust = 1)) 

Scatter Plots

Next, I took the three hypothesis variables and made a scatter plot for each agains Graduation Rate, along with a linear regression and a Generalized Additive Model (the default for geom_smooth when n>1000) to better see the effect on Graduation Rate that each hypothesis variable has as it varies.

Number of Students per School
ggplot(aes(x=ALL_COHORT, y=ALL_RATE), data=data) + 
  geom_point(alpha=0.10) + 
  xlim(0, quantile(data$ALL_COHORT, .99, na.rm=TRUE)) +
  geom_smooth(method="lm", color="red")
## Warning: Removed 1242 rows containing missing values (stat_smooth).
## Warning: Removed 1242 rows containing missing values (geom_point).

ggplot(aes(x=ALL_COHORT, y=ALL_RATE), data=data) + 
  geom_point(alpha=0.10) + 
  xlim(0, quantile(data$ALL_COHORT, .99, na.rm=TRUE)) +
  geom_smooth(color="red")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 1242 rows containing missing values (stat_smooth).
## Warning: Removed 1242 rows containing missing values (geom_point).

When plotting the Number of Students on the x axis and Graduation Rate on the y axis, we see a slight positive correlation, which matches the correlation coefficient we saw earlier. When we relax our model from being linear, we see that much of the increase occurs when the Number of Students is less than 100, and then it levels off a lot. Also, the banding that we see for cohort sizes under 200 is an effect of the rounding that occurs from the “blurring” effect applied to small schools discussed in “Loading and Cleaning The Data” section.

Pupil/Teacher Ratio
ggplot(aes(x=Pupil.Teacher.Ratio, y=ALL_RATE), data=data) + 
  geom_point(alpha=0.10) + 
  xlim(0, 75) +
  geom_smooth(method="lm", color="red")
## Warning: Removed 1050 rows containing missing values (stat_smooth).
## Warning: Removed 1050 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=ALL_RATE), data=data) + 
  geom_point(alpha=0.10) + 
  xlim(0, 75) +
  geom_smooth(color="red")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 1050 rows containing missing values (stat_smooth).
## Warning: Removed 1050 rows containing missing values (geom_point).

Looking at the scatterplot for Graduation Rate vs Pupil/Teacher Ratio, the relationship is even less clear. The linear model suggests a slight negative correlation, like we saw earlier, but the nonlinear model shows a complicated relationship, peaking around 15 students per teacher, and then decreasing as you move toward 20 students per teacher. Then there is a slight bump at about 22 teachers per student, but that’s probably because there are more schools at that level (likely due to state maximums for class size), which can be observed from the vertical banding at those levels. Also of note is the high variace around the model at the very low and very high Pupil/Teacher Rations. This is likely due to the fewer number of schools at these levels.

Revenue per Student
ggplot(aes(x=Total.General.Revenue.Per.Student, y=ALL_RATE), data=data) +
  geom_point(alpha=0.10) + 
  xlim(0, quantile(data$Total.General.Revenue.Per.Student, .99, na.rm=TRUE)) +
  geom_smooth(method="lm", color="red")
## Warning: Removed 1179 rows containing missing values (stat_smooth).
## Warning: Removed 1179 rows containing missing values (geom_point).

ggplot(aes(x=Total.General.Revenue.Per.Student, y=ALL_RATE), data=data) +
  geom_point(alpha=0.10) + 
  xlim(0, quantile(data$Total.General.Revenue.Per.Student, .99, na.rm=TRUE)) +
  geom_smooth(color="red")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 1179 rows containing missing values (stat_smooth).
## Warning: Removed 1179 rows containing missing values (geom_point).

The linear model is the flattest for Revenue Per Student, which again matches the correlation coefficient we observed earlier, indicating that Revenue Per Student has the least predictive power of the three hypothesis variables. Looking at the nonlinear model, it’s even less clear what kind of relationship this hypothesis variable has with Graduation Rate. Most surprising to me is the drop off as you move right past about $22,000 per student.

Heatmaps

While the boxplots and scatterplots lay out the data informatively, it’s harder for laypeople to glean any relationships from it because there is so much information in each plot. To make something that’s easier for the human eye to interpret, I want to map out the states geographically, since every U.S. citizen should be comfortable with that representation. First, I have to get each state’s latitude and longitude data. Then, I’ll group the school data by state. Finally, I’ll plot a heat map of each variable.

state_coords <- read.csv("state_coords.csv")

data.state_means <- data %>% 
  group_by(region) %>% 
  summarise(ALL_RATE = mean(ALL_RATE, na.rm=TRUE),
            ALL_COHORT = mean(ALL_COHORT, na.rm=TRUE),
            MAM_RATE = mean(MAM_RATE, na.rm=TRUE),
            MAM_COHORT = mean(MAM_COHORT, na.rm=TRUE),
            MAS_RATE = mean(MAS_RATE, na.rm=TRUE),
            MAS_COHORT = mean(MAS_COHORT, na.rm=TRUE),
            MBL_RATE = mean(MBL_RATE, na.rm=TRUE),
            MBL_COHORT = mean(MBL_COHORT, na.rm=TRUE),
            MHI_RATE = mean(MHI_RATE, na.rm=TRUE),
            MHI_COHORT = mean(MHI_COHORT, na.rm=TRUE),
            MTR_RATE = mean(MTR_RATE, na.rm=TRUE),
            MTR_COHORT = mean(MTR_COHORT, na.rm=TRUE),
            MWH_RATE = mean(MWH_RATE, na.rm=TRUE),
            MWH_COHORT = mean(MWH_COHORT, na.rm=TRUE),
            CWD_RATE = mean(CWD_RATE, na.rm=TRUE),
            CWD_COHORT = mean(CWD_COHORT, na.rm=TRUE),
            ECD_RATE = mean(ECD_RATE, na.rm=TRUE),
            ECD_COHORT = mean(ECD_COHORT, na.rm=TRUE),
            LEP_RATE = mean(LEP_RATE, na.rm=TRUE),
            LEP_COHORT = mean(LEP_COHORT, na.rm=TRUE),
            Full.Time.Teachers = mean(Full.Time.Teachers, na.rm=TRUE),
            Pupil.Teacher.Ratio = mean(Pupil.Teacher.Ratio, na.rm=TRUE),
            Total.General.Revenue = mean(Total.General.Revenue, na.rm=TRUE),
            Total.General.Revenue.Per.Student = mean(Total.General.Revenue.Per.Student, na.rm=TRUE),
            n = n(),
            Postal.Code = first(Postal.Code)) %>% 
  arrange(ALL_RATE)

state_data <- full_join(state_coords, data.state_means, by="region")
## Warning in outer_join_impl(x, y, by$x, by$y): joining factors with
## different levels, coercing to character vector
ggplot() + geom_polygon(data=state_data, aes(x=long, y=lat, group = group, fill=state_data$ALL_RATE),colour="white") + scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar") + xlim(-130, -65)

ggplot() + geom_polygon(data=state_data, aes(x=long, y=lat, group = group, fill=state_data$ALL_COHORT),colour="white") + scale_fill_continuous(low = "thistle2", high = "orange", guide="colorbar") + xlim(-130, -65)

ggplot() + geom_polygon(data=state_data, aes(x=long, y=lat, group = group, fill=state_data$Pupil.Teacher.Ratio),colour="white") + scale_fill_continuous(low = "thistle2", high = "darkblue", guide="colorbar") + xlim(-130, -65)

ggplot() + geom_polygon(data=state_data, aes(x=long, y=lat, group = group, fill=state_data$Total.General.Revenue.Per.Student),colour="white") + scale_fill_continuous(low = "thistle2", high = "darkgreen", guide="colorbar", trans="log10") + xlim(-130, -65)

These heat maps are much more useful for comparing the four variables, especially for the untrained, because of their familiarity. Looking at the four maps, it’s again clear that the correlations between the three predictive variables and the Graduation Rates are less strong that I originally hypothesized.

For a more granular look, I will plot the same data on a county-by-county basis instead of by state so that the data isn’t as heavily grouped, especially in the larger states.

all_counties <- read.csv("county_coords.csv")

data.county_means <- data %>% 
  group_by(region, subregion) %>% 
  summarise(ALL_RATE = mean(ALL_RATE, na.rm=TRUE),
            ALL_COHORT = mean(ALL_COHORT, na.rm=TRUE),
            MAM_RATE = mean(MAM_RATE, na.rm=TRUE),
            MAM_COHORT = mean(MAM_COHORT, na.rm=TRUE),
            MAS_RATE = mean(MAS_RATE, na.rm=TRUE),
            MAS_COHORT = mean(MAS_COHORT, na.rm=TRUE),
            MBL_RATE = mean(MBL_RATE, na.rm=TRUE),
            MBL_COHORT = mean(MBL_COHORT, na.rm=TRUE),
            MHI_RATE = mean(MHI_RATE, na.rm=TRUE),
            MHI_COHORT = mean(MHI_COHORT, na.rm=TRUE),
            MTR_RATE = mean(MTR_RATE, na.rm=TRUE),
            MTR_COHORT = mean(MTR_COHORT, na.rm=TRUE),
            MWH_RATE = mean(MWH_RATE, na.rm=TRUE),
            MWH_COHORT = mean(MWH_COHORT, na.rm=TRUE),
            CWD_RATE = mean(CWD_RATE, na.rm=TRUE),
            CWD_COHORT = mean(CWD_COHORT, na.rm=TRUE),
            ECD_RATE = mean(ECD_RATE, na.rm=TRUE),
            ECD_COHORT = mean(ECD_COHORT, na.rm=TRUE),
            LEP_RATE = mean(LEP_RATE, na.rm=TRUE),
            LEP_COHORT = mean(LEP_COHORT, na.rm=TRUE),
            Full.Time.Teachers = mean(Full.Time.Teachers, na.rm=TRUE),
            Pupil.Teacher.Ratio = mean(Pupil.Teacher.Ratio, na.rm=TRUE),
            Total.General.Revenue = mean(Total.General.Revenue, na.rm=TRUE),
            Total.General.Revenue.Per.Student = mean(Total.General.Revenue.Per.Student, na.rm=TRUE),
            n = n()) %>% 
  ungroup() %>% 
  arrange(ALL_RATE)

county_data <- left_join(all_counties, data.county_means, by=c("region", "subregion"))
## Warning in left_join_impl(x, y, by$x, by$y): joining factors with different
## levels, coercing to character vector
## Warning in left_join_impl(x, y, by$x, by$y): joining factors with different
## levels, coercing to character vector
ggplot() + geom_polygon(data=county_data, aes(x=long, y=lat, group = group, fill=county_data$ALL_RATE),colour="white") + scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar")

ggplot() + geom_polygon(data=county_data, aes(x=long, y=lat, group = group, fill=county_data$ALL_COHORT),colour="white") + scale_fill_continuous(low = "thistle2", high = "orange", guide="colorbar", trans="log10")

ggplot() + geom_polygon(data=county_data, aes(x=long, y=lat, group = group, fill=county_data$Pupil.Teacher.Ratio),colour="white") + scale_fill_continuous(low = "thistle2", high = "darkblue", guide="colorbar")

ggplot() + geom_polygon(data=county_data, aes(x=long, y=lat, group = group, fill=county_data$Total.General.Revenue.Per.Student),colour="white") + scale_fill_continuous(low = "thistle2", high = "darkgreen", guide="colorbar", trans="log10")

Even though these maps are more details, it’s even harder to eyeball correlations between them because the regions are so much smaller. There is definitely a need for a more encompassing, multivariate exploration.

Multivariate Analysis

While some relationships can be seen by comparing multiple plots above, I now plan to find some deeper relationships by combining multiple variables into single plots.

Exploring Sub-Cohorts

One aspect of the data that I’ve neglected thus far is the breakout of graduation rates among various groups of students:

ALL: All students combined

MAM: Native American students

MAS: Asian/Pacific Islander students

MBL: Black students

MHI: Hispanic students

MTR: Studens with two or more races

MWH: White students

CWD: Students with disabilities

ECD: Economically disadvantaged students

LEP: Students with limited English proficiency

To compare all of these cohorts with the overall graduation rate, I plotted a line graph connecting each state’s graduation rates for each cohort, ordered by the overall graduation rate (the black line).

data.state_means$Postal.Code <- factor(data.state_means$Postal.Code, as.character(data.state_means$Postal.Code))

ggplot(aes(x=Postal.Code), data=filter(data.state_means, !is.na(Postal.Code))) + theme(axis.text.x = element_text(angle = 270, hjust = 1)) +
  geom_line(aes(y=ALL_RATE, group=2), color="black") +
  geom_point(aes(y=MAM_RATE), color="red") + geom_line(aes(y=MAM_RATE, group=2), color="red", alpha=0.5) + 
  geom_point(aes(y=MAS_RATE), color="orange") + geom_line(aes(y=MAS_RATE, group=2), color="orange", alpha=0.5) + 
  geom_point(aes(y=MBL_RATE), color="yellow") + geom_line(aes(y=MBL_RATE, group=2), color="yellow", alpha=0.5) + 
  geom_point(aes(y=MHI_RATE), color="green") + geom_line(aes(y=MHI_RATE, group=2), color="green", alpha=0.5) + 
  geom_point(aes(y=MTR_RATE), color="lightblue") + geom_line(aes(y=MTR_RATE, group=2), color="lightblue", alpha=0.5) + 
  geom_point(aes(y=MWH_RATE), color="blue") + geom_line(aes(y=MWH_RATE, group=2), color="blue", alpha=0.5) + 
  geom_point(aes(y=CWD_RATE), color="darkblue") + geom_line(aes(y=CWD_RATE, group=2), color="darkblue", alpha=0.5) + 
  geom_point(aes(y=ECD_RATE), color="purple") + geom_line(aes(y=ECD_RATE, group=2), color="purple", alpha=0.5) + 
  geom_point(aes(y=LEP_RATE), color="magenta") + geom_line(aes(y=LEP_RATE, group=2), color="magenta", alpha=0.5) 
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing missing values (geom_point).

Since these states are ordered by the overall graduation rate, I was looking for an upward or downward trend in any of the student cohorts to see if a particular student group either bolstered or undermined the total graduation rate. The strongest upward trend was for white students, but this alone doens’t mean that white students are a better indicator of a school’s success, since it’s likely that most schools are predominantly white, so I went on to graph the ratio of each cohort with the total student population to see if that was the case.

I start with the racial groups first, and then plot the other groups second.

ggplot(aes(x=Postal.Code), data=filter(data.state_means, !is.na(Postal.Code))) + theme(axis.text.x = element_text(angle = 270, hjust = 1)) +
  geom_line(aes(y=MAM_COHORT/ALL_COHORT, group=2, color="Native Am")) + 
  geom_line(aes(y=MAS_COHORT/ALL_COHORT, group=2, color="Asian")) + 
  geom_line(aes(y=MBL_COHORT/ALL_COHORT, group=2, color="Black")) + 
  geom_line(aes(y=MHI_COHORT/ALL_COHORT, group=2, color="Hispanic")) + 
  geom_line(aes(y=MTR_COHORT/ALL_COHORT, group=2, color="2+ Races")) + 
  geom_line(aes(y=MWH_COHORT/ALL_COHORT, group=2, color="White")) +
  ylab("") + scale_colour_brewer(palette="Dark2", name="Cohorts")

ggplot(aes(x=Postal.Code), data=filter(data.state_means, !is.na(Postal.Code))) + theme(axis.text.x = element_text(angle = 270, hjust = 1)) +
  geom_line(aes(y=CWD_COHORT/ALL_COHORT, group=2, color="Disabled")) + 
  geom_line(aes(y=ECD_COHORT/ALL_COHORT, group=2, color="Econ Disadv")) + 
  geom_line(aes(y=LEP_COHORT/ALL_COHORT, group=2, color="Limited Eng")) +
  ylab("") + scale_colour_brewer(palette="Dark2", name="Cohorts")

In the race cohort plot, it’s clear that white students predominate, especially in Alaska, where erroneously there are more than 100% white students. But the gap between white students and other races doesn’t grow much as you go up in the rankings. Plus, states like California and Texas have good rankings and actually have more Hispanic students than White students.

In the other plot, while I was surprised to see the percentage of economically disadvantaged students across the nation as a whole, none of these three cohorts rise or fall that regularly as you go up in the states’ rankings.

The previous plots are a little hard to interpret, so I think a better way to demonstrate the racial make up of each state’s students is a stacked bar chart. I ordered the states by total number of students who reported their race, and within each bar, I ordered the fills by the largest race in each particular state.

sub_cohorts <- select(data.state_means, c(-ALL_RATE, -MAM_RATE, -MAS_RATE, -MBL_RATE, -MHI_RATE, -MTR_RATE, -MWH_RATE, -CWD_RATE, -ECD_RATE, -LEP_RATE, -CWD_COHORT, -ECD_COHORT, -LEP_COHORT)) %>% 
  gather(DEMO, COHORT, MAM_COHORT, MAS_COHORT, MBL_COHORT, MHI_COHORT, MTR_COHORT, MWH_COHORT) %>% 
  filter(!is.na(Postal.Code))

sub_cohorts.group <- sub_cohorts %>% group_by(region) %>% 
  summarise(cohort.sum = sum(COHORT, na.rm=TRUE))

sub_cohorts <- left_join(sub_cohorts, sub_cohorts.group, by="region")

sub_cohorts <- sub_cohorts %>% arrange(cohort.sum, desc(COHORT)) 

sub_cohorts$Postal.Code <- factor(sub_cohorts$Postal.Code, as.character(sub_cohorts$Postal.Code))
## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated
ggplot(aes(x=Postal.Code, y=COHORT), data=sub_cohorts) + theme(axis.text.x = element_text(angle = 270, hjust = 1)) +
  geom_bar(aes(fill=DEMO), position="stack", stat="identity") 
## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated
## Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
## else paste0(labels, : duplicated levels in factors are deprecated
## Warning: Removed 14 rows containing missing values (position_stack).

This representation really shows by how much White students are the predominant race, and the close relationship between white students’ graduation rates and total graduation rates is a result of this fact.

Comparing the Hypothesis Variables to Each Other and Graduation Rate

While these cohort differences are interesting, I’m going to now refocus back on my main hypothesis variables, namely, how is Graduation Rate affected by Number of Students per School, Pupil/Teacher Ratios, and Revenue per Student levels.

I tried a number of visualization combinations for Graduation Rate and the three hypothesis variables, but the data is clustered in such a way that I wasn’t able to show much. Here are some examples of those plots.

ggplot(aes(x=ALL_COHORT, y=ALL_RATE), data=data) + 
  geom_point(aes(color=Total.General.Revenue.Per.Student), alpha=0.05) + 
  xlim(0, 1200)
## Warning: Removed 1056 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=ALL_RATE), data=data) + 
  geom_point(aes(color=Total.General.Revenue.Per.Student), alpha=0.05) +
  xlim(0, 40)
## Warning: Removed 1077 rows containing missing values (geom_point).

Due to the way my data is distributed, these two plots are very hard to interpret. This particular perspective has a lot of clustering, both spacially and chromatically (with many of the data points being on the dark end of the spectrum).

ggplot(aes(x=Pupil.Teacher.Ratio, y=ALL_RATE), data=data) + 
  geom_point(aes(size=ALL_COHORT), alpha=0.05) + 
  xlim(0, 40)
## Warning: Removed 1077 rows containing missing values (geom_point).

This plot tries to use size instead of color to combine two of the hypothesis variables and compare them to Graduation Rate. Still, this plot is nearly indecipherable.

ggplot(aes(x=Pupil.Teacher.Ratio, y=ALL_RATE), data=data) + 
  geom_point(aes(size=Total.General.Revenue.Per.Student, color=region), alpha=0.15) + 
  xlim(0, 40) + theme(legend.position="none")
## Warning: Removed 1077 rows containing missing values (geom_point).

This plot colors each data point by state, but there’s too many colors to distinguish from, and the legend didn’t even fit on the plot, so I removed it altogether.

I’m starting to realize that the data is clustered in a way that makes multivariate analysis difficult. So I decide to do a few transformations that will make it easier to interpret.

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), data=data) + 
  geom_point(aes(size=ALL_COHORT), alpha=0.1) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) 
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

This is already a nicer graph to look at, with a nice crescent shape. It also nicely shows how the three hypothesis variables relate to each other. It makes sense that the data exists in three quadrants. The empty, upper-right quadrant would be schools that spend a lot of money per student, yet still have very few teachers per student, which doesn’t seem like a likely scenario.

However, this graph is limited because it doens’t relate any of the variables to Graduation Rate, which as I’ve said is the most important variable to consider in this data set. Since I didn’t have much luck with all of the data as a whole, I decided to try something that worked pretty well earlier when I was plotting the heatmaps of the variables, and do a state by state comparison, singling out states one by one to suss out the relationships between the variables.

I start by ordering the states by Graduation Rate, so that I can look at the top and bottom states and show the relationship (or lackthereof) between that value and the other variables.

print.data.frame(
  data %>% group_by(region) %>% 
    summarize(grad.rate=mean(ALL_RATE, na.rm=TRUE)) %>% 
    arrange(desc(grad.rate))
)
##                  region grad.rate
## 1                  IOWA  88.26490
## 2         NEW HAMPSHIRE  87.42183
## 3              NEBRASKA  86.89516
## 4            NEW JERSEY  86.69148
## 5               VERMONT  86.57090
## 6             TENNESSEE  85.93794
## 7              MISSOURI  85.83430
## 8                KANSAS  85.59288
## 9          PENNSYLVANIA  85.56588
## 10          CONNECTICUT  85.50281
## 11             ARKANSAS  84.62547
## 12              INDIANA  84.25785
## 13             NEW YORK  84.14314
## 14                TEXAS  84.09530
## 15                MAINE  83.98745
## 16            WISCONSIN  83.59072
## 17         NORTH DAKOTA  83.30064
## 18             ILLINOIS  82.68784
## 19         SOUTH DAKOTA  82.49230
## 20       NORTH CAROLINA  82.31471
## 21             VIRGINIA  82.25354
## 22        MASSACHUSETTS  82.04639
## 23              MONTANA  81.03929
## 24                 OHIO  80.73274
## 25        WEST VIRGINIA  80.59427
## 26               HAWAII  80.26880
## 27           CALIFORNIA  80.03340
## 28         RHODE ISLAND  77.21854
## 29            LOUISIANA  76.33099
## 30          MISSISSIPPI  75.77132
## 31              ALABAMA  75.55576
## 32              WYOMING  74.84923
## 33             MARYLAND  74.75619
## 34                 UTAH  74.71588
## 35       SOUTH CAROLINA  73.21392
## 36             DELAWARE  72.14964
## 37              ARIZONA  70.70917
## 38             COLORADO  69.96055
## 39             MICHIGAN  69.42670
## 40            MINNESOTA  68.29072
## 41           NEW MEXICO  67.87550
## 42               OREGON  66.40423
## 43              GEORGIA  66.32037
## 44           WASHINGTON  65.63084
## 45 DISTRICT OF COLUMBIA  59.81687
## 46               NEVADA  59.58883
## 47              FLORIDA  56.63701
## 48               ALASKA  56.44120

Here are the graphs of the three hypothesis variables (Pupil/Teacher Ratio on the x axis, Revenue per Student on the y axis, and Number of Students per School for the size of each point) plotted by state. For each plot, the highlighted state is in blue, and all the other states’ data is in a salmon color with translucency added.

The first ten plots are the top ten states ranked by their Graduation Rate.

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="IOWA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="IOWA"), 
                 alpha=ifelse(region=="IOWA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Iowa (#1 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="NEW HAMPSHIRE")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="NEW HAMPSHIRE"), 
                 alpha=ifelse(region=="NEW HAMPSHIRE", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("New Hampshire (#2 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="NEBRASKA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="NEBRASKA"), 
                 alpha=ifelse(region=="NEBRASKA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Nebraska (#3 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="NEW JERSEY")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="NEW JERSEY"), 
                 alpha=ifelse(region=="NEW JERSEY", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("New Jersey (#4 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="VERMONT")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="VERMONT"), 
                 alpha=ifelse(region=="VERMONT", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Vermont (#5 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="TENNESSEE")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="TENNESSEE"), 
                 alpha=ifelse(region=="TENNESSEE", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Tennessee (#6 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="MISSOURI")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="MISSOURI"), 
                 alpha=ifelse(region=="MISSOURI", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Missouri (#7 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="KANSAS")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="KANSAS"), 
                 alpha=ifelse(region=="KANSAS", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Kansas (#8 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="PENNSYLVANIA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="PENNSYLVANIA"), 
                 alpha=ifelse(region=="PENNSYLVANIA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Pennsylvania (#9 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="CONNECTICUT")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="CONNECTICUT"), 
                 alpha=ifelse(region=="CONNECTICUT", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Connecticut (#10 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

Nearly all of these schools have low Pupil/Teacher Ratios, whereas the Revenue per Student varies more. It’s notable that none of the three hypothesis variables are very consistent from state to state, which reinforces my narrative that none of these variables is an especially good predictor of Graduation Rate.

The next ten plots are the bottom ten states when ranked by Graduation Rate.

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="MICHIGAN")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="MICHIGAN"), 
                 alpha=ifelse(region=="MICHIGAN", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Michigan (#39 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="MINNESOTA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="MINNESOTA"), 
                 alpha=ifelse(region=="MINNESOTA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Minnesota (#40 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="NEW MEXICO")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="NEW MEXICO"), 
                 alpha=ifelse(region=="NEW MEXICO", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("New Mexico (#41 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="OREGON")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="OREGON"), 
                 alpha=ifelse(region=="OREGON", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Oregon (#42 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="GEORGIA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="GEORGIA"), 
                 alpha=ifelse(region=="GEORGIA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Georgia (#43 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="WASHINGTON")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="WASHINGTON"), 
                 alpha=ifelse(region=="WASHINGTON", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Washington (#44 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="DISTRICT OF COLUMBIA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="DISTRICT OF COLUMBIA"), 
                 alpha=ifelse(region=="DISTRICT OF COLUMBIA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("DC (#45 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="NEVADA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="NEVADA"), 
                 alpha=ifelse(region=="NEVADA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Nevada (#46 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="FLORIDA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="FLORIDA"), 
                 alpha=ifelse(region=="FLORIDA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Florida (#47 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

ggplot(aes(x=Pupil.Teacher.Ratio, y=Total.General.Revenue.Per.Student), 
       data=arrange(data, region=="ALASKA")) + 
  geom_point(aes(size=ALL_COHORT, color=(region=="ALASKA"), 
                 alpha=ifelse(region=="ALASKA", 1, 0.05))) +
  scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10") +
  xlim(5, 40) + ylim(5000, 50000) + ggtitle("Alaska (#48 in Grad Rate)")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 209 rows containing missing values (geom_point).

These states tend to have a higher Pupil/Teacher Ratio, but the Revenue per Student doesn’t have as strong of a trend. Some of these bottom states even look remarkingly similar to the top states. Take Alaska, which ranks last in Graduation Rate, and it’s in that low Pupil/Teach Ration and high Revenue per Student quartile, like many of the top states.

These lower ranked states have more within-state variation, which means that there are some schools bringing down the state average, but it’s not common that a majority of the schools in a state are all performing poorly.

I decided to test this hypothesis that lower ranking states have more variation in Graduation Rate on a school-to-school basis by ranking the states by this parameter.

state.grad_rate_variations <- data %>% group_by(region) %>% 
  summarize(grad.rate.variation=var(ALL_RATE, na.rm=TRUE), 
            grad.rate.mean=mean(ALL_RATE, na.rm=TRUE)) %>% 
  arrange(desc(grad.rate.variation))

print.data.frame(state.grad_rate_variations)
##                  region grad.rate.variation grad.rate.mean
## 1               FLORIDA          1085.89435       56.63701
## 2             MINNESOTA           911.69205       68.29072
## 3              MICHIGAN           905.07762       69.42670
## 4            WASHINGTON           792.13320       65.63084
## 5                ALASKA           770.84843       56.44120
## 6                NEVADA           745.72941       59.58883
## 7  DISTRICT OF COLUMBIA           730.99121       59.81687
## 8               ARIZONA           704.20157       70.70917
## 9              COLORADO           702.61280       69.96055
## 10             MARYLAND           666.78627       74.75619
## 11                 OHIO           592.61754       80.73274
## 12             DELAWARE           581.63293       72.14964
## 13               OREGON           530.50594       66.40423
## 14              GEORGIA           525.60691       66.32037
## 15           CALIFORNIA           491.32554       80.03340
## 16                 UTAH           459.22952       74.71588
## 17           NEW MEXICO           422.44132       67.87550
## 18            WISCONSIN           414.42036       83.59072
## 19        MASSACHUSETTS           406.11925       82.04639
## 20              ALABAMA           388.31329       75.55576
## 21              WYOMING           386.69483       74.84923
## 22                TEXAS           351.25127       84.09530
## 23       SOUTH CAROLINA           328.42452       73.21392
## 24         PENNSYLVANIA           266.59493       85.56588
## 25              INDIANA           256.41295       84.25785
## 26         SOUTH DAKOTA           254.44447       82.49230
## 27         RHODE ISLAND           241.98595       77.21854
## 28           NEW JERSEY           237.99696       86.69148
## 29               KANSAS           237.24395       85.59288
## 30            TENNESSEE           233.37246       85.93794
## 31              MONTANA           230.71152       81.03929
## 32            LOUISIANA           224.10395       76.33099
## 33             NEW YORK           218.09788       84.14314
## 34             MISSOURI           213.78781       85.83430
## 35                 IOWA           212.05583       88.26490
## 36          CONNECTICUT           207.59165       85.50281
## 37       NORTH CAROLINA           200.38411       82.31471
## 38             ILLINOIS           193.71166       82.68784
## 39               HAWAII           190.31866       80.26880
## 40         NORTH DAKOTA           185.73402       83.30064
## 41             ARKANSAS           145.58466       84.62547
## 42             VIRGINIA           120.51495       82.25354
## 43          MISSISSIPPI           117.76825       75.77132
## 44                MAINE           108.90805       83.98745
## 45             NEBRASKA            98.58361       86.89516
## 46        NEW HAMPSHIRE            73.70071       87.42183
## 47              VERMONT            66.69807       86.57090
## 48        WEST VIRGINIA            61.00055       80.59427
with(state.grad_rate_variations, cor(grad.rate.variation, grad.rate.mean))
## [1] -0.829234

So the variation of Graduation Rate is very strongly negatively correlated with mean Graduation Rate. This makes a lot of sense because there’s a hard limit of a 100% Graduation Rate, so for schools to be at the top, most of their schools have to be near 100%. However, it does add some knowledge, because it could be the case that the states at the bottom have a lot of poorly performing schools, but this correlation indicates that those states actually have a mix of strong and weak schools.

Slope Plots of State Ranks

Finally, I want to see how the state rankings by the hypothesis variables stack up agains the rankings by Graduation Rate. So I first compute those three new rankings, and then plot a slope plot for each to show how the rankings change. If any of the hypothesis variables were good predictors, then the slopes of the lines would be small, meaning that a ranking of the hypothesis variable is closer to the Graduation Rate ranking. That is, the states at the top of one ranking are in the top of the other ranking, and the same for the bottom states.

sizeRanks <- data %>% group_by(Postal.Code) %>% 
  summarize(size=mean(ALL_COHORT, na.rm=TRUE)) %>% 
  arrange(desc(size)) %>% 
  mutate(size.rank=row_number())
sizeRanks <- sizeRanks %>% mutate(size.rank=dim(sizeRanks)[1]-size.rank)

ratioRanks <- data %>% group_by(Postal.Code) %>% 
  summarize(ratio=mean(Pupil.Teacher.Ratio, na.rm=TRUE)) %>% 
  arrange(desc(ratio)) %>% 
  mutate(ratio.rank=row_number())
ratioRanks <- ratioRanks %>% mutate(ratio.rank=dim(ratioRanks)[1]-ratio.rank)

revRanks <- data %>% group_by(Postal.Code) %>% 
  summarize(rev.per.student=mean(Total.General.Revenue.Per.Student, na.rm=TRUE)) %>% 
  arrange(desc(rev.per.student)) %>% 
  mutate(rev.per.student.rank=row_number())
revRanks <- revRanks %>% mutate(rev.per.student.rank=dim(revRanks)[1]-rev.per.student.rank)

gradRanks <- data %>% group_by(Postal.Code) %>% 
  summarize(grad.rate=mean(ALL_RATE, na.rm=TRUE)) %>% 
  arrange(desc(grad.rate)) %>% 
  mutate(grad.rate.rank=row_number())
gradRanks <- gradRanks %>% mutate(grad.rate.rank=dim(gradRanks)[1]-grad.rate.rank)

sizeRanks <- inner_join(sizeRanks, gradRanks, by="Postal.Code")
ratioRanks <- inner_join(ratioRanks, gradRanks, by="Postal.Code")
revRanks <- inner_join(revRanks, gradRanks, by="Postal.Code")
p<-ggplot(sizeRanks) + geom_segment(aes(x=0,xend=1,y=size.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())     
p<-p + geom_text(label=sizeRanks$Postal.Code, y=sizeRanks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=sizeRanks$Postal.Code, y=sizeRanks$size.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="Revenue Per Student Ranks", x=0, y=(1.05*(max(sizeRanks$grad.rate.rank,sizeRanks$size.rank))),hjust= 1.2,size=3.5)
p<-p + geom_text(label="Graduation Rate Ranks", x=1, y=(1.05*(max(sizeRanks$grad.rate.rank,sizeRanks$size.rank))),hjust=-0.1,size=3.5)
p

p<-ggplot(ratioRanks) + geom_segment(aes(x=0,xend=1,y=ratio.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())     
p<-p + geom_text(label=ratioRanks$Postal.Code, y=ratioRanks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=ratioRanks$Postal.Code, y=ratioRanks$ratio.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="Revenue Per Student Ranks", x=0, y=(1.05*(max(ratioRanks$grad.rate.rank,ratioRanks$ratio.rank))),hjust= 1.2,size=3.5)
p<-p + geom_text(label="Graduation Rate Ranks", x=1, y=(1.05*(max(ratioRanks$grad.rate.rank,ratioRanks$ratio.rank))),hjust=-0.1,size=3.5)
p

p<-ggplot(revRanks) + geom_segment(aes(x=0,xend=1,y=rev.per.student.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())     
p<-p + geom_text(label=revRanks$Postal.Code, y=revRanks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=revRanks$Postal.Code, y=revRanks$rev.per.student.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="Revenue Per Student Ranks", x=0, y=(1.05*(max(revRanks$grad.rate.rank,revRanks$rev.per.student.rank))),hjust= 1.2,size=3.5)
p<-p + geom_text(label="Graduation Rate Ranks", x=1, y=(1.05*(max(revRanks$grad.rate.rank,revRanks$rev.per.student.rank))),hjust=-0.1,size=3.5)
p

As you can see, the slopes of the lines in all three plots are large (giving the plot a very tangled appearance), which again supports the narrative that none of these three variables is a good predictor of Graduation Rate.

To clean up these slope plots, I decided to remove all of the states except the top 10 and bottom 10 in the first ranking of each plot. This shows the where the best and worst schools by each hypothesis variable end up in the Graduation Rate rankings.

number.of.ranks <- 10

top.and.bottom.ranks <- filter(sizeRanks, size.rank < number.of.ranks | size.rank >= dim(sizeRanks)[1] - number.of.ranks)
p<-ggplot(top.and.bottom.ranks) + geom_segment(aes(x=0,xend=1,y=size.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())  
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$size.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="Revenue Per Student Ranks", x=0, y=(1.05*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$size.rank))),hjust= 1.2,size=3.5)
p<-p + geom_text(label="Graduation Rate Ranks", x=1, y=(1.05*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$size.rank))),hjust=-0.1,size=3.5)
p

top.and.bottom.ranks <- filter(ratioRanks, ratio.rank < number.of.ranks | ratio.rank >= dim(ratioRanks)[1] - number.of.ranks)
p<-ggplot(top.and.bottom.ranks) + geom_segment(aes(x=0,xend=1,y=ratio.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())  
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$ratio.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="Revenue Per Student Ranks", x=0, y=(1.05*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$ratio.rank))),hjust= 1.2,size=3.5)
p<-p + geom_text(label="Graduation Rate Ranks", x=1, y=(1.05*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$ratio.rank))),hjust=-0.1,size=3.5)
p

top.and.bottom.ranks <- filter(revRanks, rev.per.student.rank < number.of.ranks | rev.per.student.rank >= dim(revRanks)[1] - number.of.ranks)
p<-ggplot(top.and.bottom.ranks) + geom_segment(aes(x=0,xend=1,y=rev.per.student.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())            
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$rev.per.student.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="Revenue Per Student Ranks", x=0, y=(1.05*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$rev.per.student.rank))),hjust= 1.2,size=3.5)
p<-p + geom_text(label="Graduation Rate Ranks", x=1, y=(1.05*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$rev.per.student.rank))),hjust=-0.1,size=3.5)
p

Again, we see that the hypothesis variable rankings are not good indicators of where a state stacks up when ranked by Graduation Rate.

Final Plots and Summary

For my final visualizations, I picked 3 plots that show the relationship between Revenue per Student and Graduation Rate. They all fall on different parts of the “Information Density” versus “Accessibility” spectrum. The plots will go in order of increasing information density, meaning we’ll start with the most accessible.

top.and.bottom.ranks <- filter(revRanks, rev.per.student.rank < number.of.ranks | rev.per.student.rank >= dim(revRanks)[1] - number.of.ranks)
p<-ggplot(top.and.bottom.ranks) + geom_segment(aes(x=0,xend=1,y=rev.per.student.rank,yend=grad.rate.rank),size=.75) + scale_x_continuous(breaks = c())
p<-p + theme(panel.background = element_blank())
p<-p + theme(panel.grid=element_blank())
p<-p + theme(axis.ticks=element_blank())
p<-p + theme(axis.text=element_blank())
p<-p + theme(panel.border=element_blank())            
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$grad.rate.rank, x=1,hjust=-0.2,size=3.5)
p<-p + geom_text(label=top.and.bottom.ranks$Postal.Code, y=top.and.bottom.ranks$rev.per.student.rank, x=0,hjust=1.2,size=3.5)
p<-p + geom_text(label="'Revenue Per Student' Ranks", x=0, y=(1.03*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$rev.per.student.rank))),hjust= 0.1,size=5)
p<-p + geom_text(label="'Graduation Rate' Ranks", x=1, y=(1.03*(max(top.and.bottom.ranks$grad.rate.rank,top.and.bottom.ranks$rev.per.student.rank))),hjust=--0.9,size=5)
p<-p + xlab("") + ylab("") + ggtitle("State Rankings of Scholastic Performance")
p

One would expect a state’s ranking for ‘Revenue Per Student’ to be a strong indicator of that state’s ranking for ‘Graduation Rate’, but looking at this graph, it’s clear that the variation between the two rankings is very high, with the top 10 and bottom 10 schools in revenue get wildly different rankings for their graduation rate.

map1 <- ggplot() + geom_polygon(data=state_data, aes(x=long, y=lat, group = group, fill=state_data$ALL_RATE),colour="lightgray") + scale_fill_continuous(low = "#EEEEEE", high = "darkred", guide="colorbar", name="Graduation Rate        ") + xlab("") + ylab("") + scale_x_continuous(breaks=c(), limits=c(-130, -65)) + scale_y_continuous(breaks=c()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "white")) + ggtitle("Graduation Rate by State")

map2 <- ggplot() + geom_polygon(data=state_data, aes(x=long, y=lat, group = group, fill=state_data$Total.General.Revenue.Per.Student),colour="lightgray") + scale_fill_continuous(low = "#DDDDDD", high = "darkgreen", guide="colorbar", trans="log10", name="Revenue per Student", breaks=c(10000, 15000, 20000, 25000)) + xlab("") + ylab("") + scale_x_continuous(breaks=c(), limits=c(-130, -65)) + scale_y_continuous(breaks=c()) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.line = element_line(colour = "white")) + ggtitle("Revenue per Student by State")

grid.arrange(map1, map2)

By showing the Graduation Rates and Revenues per Student for each state, one can see that the states with high graduation rates don’t necessarily have high Revenues per Student and vice versa, further illustrating the lack of correlation between the two variables.

ggplot(aes(x=Total.General.Revenue.Per.Student, y=ALL_RATE), data=data) +
  geom_point(alpha=0.10) + 
  xlim(0, quantile(data$Total.General.Revenue.Per.Student, .99, na.rm=TRUE)) +
  ggtitle("Graduation Rate versus Revenue Per Student") + 
  xlab("Revenue per Student") + ylab("Graduation Rate") +
  geom_smooth(color="red")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 1179 rows containing missing values (stat_smooth).
## Warning: Removed 1179 rows containing missing values (geom_point).

This scatterplot shows the most information of these three plots. While there is a hint of a positive relationship between the two variables, the regression model shows that it is a tenuous relationship at best. In the upper and lower ranges for Revenue per Student, there aren’t many samples, so the variance is large. For the most dense section of the data, there isn’t a clear monotonic relationship, since the regression curve peaks and dips in several spots.

Combined, I believe these three plots demonstrate the lack of correlation between Graduation Rate and Revenue per Student. In my time working on this data exploration, I’ve shared this surprising finding with some of my friends, most of whom were skeptical of my claim since intuition suggests a relationship would exist. I would think that were I to show them these plots, they would be more convinced that the lack of a relationship between the two variables is, in deed, true.

Reflection

Conclusions

At every step of my analysis of the hypothesis variables and their relationship to Graduation Rates, I became more convinced of the lack of influence that they had on Graduation Rates. Even though my initial intuition was that they would be very important, the data did not show this relationship. The best data analysis confronts our incorrect assumptions, so in that sense, my analysis was very successful.

I found it very difficult to visualize the lack of a relationship as opposed to visualizing the presence of a relationship, but I think that, even though not all of my attempts were successful, my analysis as a whole showed this lack of relationship.

I think the most successful graphs were the scatterplots with regression models since they showed each hypothesis variable and its complicated relationship with Graduation Rate.

Also, I was really happy with my multivariate analysis where I highlighted each state one by one. These plots combined all of my hypothesis variables into one graph per state, and comparing these plots showed how you couldn’t really make a guess at how the state would rank with respect to its average Graduation Rate just giving its distribution on that plot. One thing I would do if I continued work on this project would be to make an interactive version of these plots, allowing a user to select which state to highlight.

Further Analysis

Additionally, if I had more time to devote ot this analysis, I would like to examine state by state education laws that might account for some of the difference I observed. For example, some schools have incentives for schools to keep their class sizes under a certain threshold, which would affect that state’s Pupil/Teacher Ratio.

Also, there are many other features of each school that might be better predictors than the three hypothesis variables I chose, so I would love to look at those other features to see if some combination of them explained why a particular school did well at getting students to graduate.

Lessons

  1. When finding a data set, raw data is much easier to work with than aggregate data (graduation rate ranges in my case due to FERPA)
  2. Check important variables at the beginning of your EDA, because sometimes the values are very off, and you need to figure out what’s going wrong before you start incorporating those variables into visualizations and summaries. I was actually very far into my analysis before I realized that I was computing Revenue per Student incorrectly, which caused me to have to redo some of my analysis.
  3. Showing the lack of a relationship is harder than showing the presence of a relationship, but I still find that my analysis was worthwhile because the unlearning of incorrect relationships is just as important as the learning of true relationships.
  4. A lot of domain knowledge is needed to do a proper analysis of the data. I learned so much about the education system in this analysis, and I’m sure there are many things I don’t know about the U.S. education system that would have greatly improved my analysis. I look forward to further researching this domain to be better able to interpret its data.